Design an agent to fly a quadcopter, and then train it using a reinforcement learning algorithm of your choice!
Try to apply the techniques you have learnt, but also feel free to come up with innovative ideas and test them.
Take a look at the files in the directory to better understand the structure of the project.
task.py: Define your task (environment) in this file.agents/: Folder containing reinforcement learning agents.policy_search.py: A sample agent has been provided here.agent.py: Develop your agent here.physics_sim.py: This file contains the simulator for the quadcopter. DO NOT MODIFY THIS FILE.For this project, you will define your own task in task.py. Although we have provided a example task to get you started, you are encouraged to change it. Later in this notebook, you will learn more about how to amend this file.
You will also design a reinforcement learning agent in agent.py to complete your chosen task.
You are welcome to create any additional files to help you to organize your code. For instance, you may find it useful to define a model.py file defining any needed neural network architectures.
We provide a sample agent in the code cell below to show you how to use the sim to control the quadcopter. This agent is even simpler than the sample agent that you'll examine (in agents/policy_search.py) later in this notebook!
The agent controls the quadcopter by setting the revolutions per second on each of its four rotors. The provided agent in the Basic_Agent class below always selects a random action for each of the four rotors. These four speeds are returned by the act method as a list of four floating-point numbers.
For this project, the agent that you will implement in agents/agent.py will have a far more intelligent method for selecting actions!
import random
class Basic_Agent():
def __init__(self, task):
self.task = task
def act(self):
new_thrust = random.gauss(450., 25.)
return [new_thrust + random.gauss(0., 1.) for x in range(4)]
Run the code cell below to have the agent select actions to control the quadcopter.
Feel free to change the provided values of runtime, init_pose, init_velocities, and init_angle_velocities below to change the starting conditions of the quadcopter.
The labels list below annotates statistics that are saved while running the simulation. All of this information is saved in a text file data.txt and stored in the dictionary results.
%load_ext autoreload
%autoreload 2
import csv
import numpy as np
from task import Task
# Modify the values below to give the quadcopter a different starting position.
runtime = 5. # time limit of the episode
init_pose = np.array([0., 0., 0., 0., 0., 0.]) # initial pose
init_velocities = np.array([0., 0., 0.]) # initial velocities
init_angle_velocities = np.array([0., 0., 0.]) # initial angle velocities
file_output = 'data.txt' # file name for saved results
# Setup
task = Task(init_pose, init_velocities, init_angle_velocities, runtime)
agent = Basic_Agent(task)
done = False
labels = ['time', 'x', 'y', 'z', 'phi', 'theta', 'psi', 'x_velocity',
'y_velocity', 'z_velocity', 'phi_velocity', 'theta_velocity',
'psi_velocity', 'rotor_speed1', 'rotor_speed2', 'rotor_speed3', 'rotor_speed4']
results = {x : [] for x in labels}
# Run the simulation, and save the results.
with open(file_output, 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(labels)
while True:
rotor_speeds = agent.act()
_, _, done = task.step(rotor_speeds)
to_write = [task.sim.time] + list(task.sim.pose) + list(task.sim.v) + list(task.sim.angular_v) + list(rotor_speeds)
for ii in range(len(labels)):
results[labels[ii]].append(to_write[ii])
writer.writerow(to_write)
if done:
break
Run the code cell below to visualize how the position of the quadcopter evolved during the simulation.
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(results['time'], results['x'], label='x')
plt.plot(results['time'], results['y'], label='y')
plt.plot(results['time'], results['z'], label='z')
plt.legend()
_ = plt.ylim()
The next code cell visualizes the velocity of the quadcopter.
plt.plot(results['time'], results['x_velocity'], label='x_hat')
plt.plot(results['time'], results['y_velocity'], label='y_hat')
plt.plot(results['time'], results['z_velocity'], label='z_hat')
plt.legend()
_ = plt.ylim()
Next, you can plot the Euler angles (the rotation of the quadcopter over the $x$-, $y$-, and $z$-axes),
plt.plot(results['time'], results['phi'], label='phi')
plt.plot(results['time'], results['theta'], label='theta')
plt.plot(results['time'], results['psi'], label='psi')
plt.legend()
_ = plt.ylim()
before plotting the velocities (in radians per second) corresponding to each of the Euler angles.
plt.plot(results['time'], results['phi_velocity'], label='phi_velocity')
plt.plot(results['time'], results['theta_velocity'], label='theta_velocity')
plt.plot(results['time'], results['psi_velocity'], label='psi_velocity')
plt.legend()
_ = plt.ylim()
Finally, you can use the code cell below to print the agent's choice of actions.
plt.plot(results['time'], results['rotor_speed1'], label='Rotor 1 revolutions / second')
plt.plot(results['time'], results['rotor_speed2'], label='Rotor 2 revolutions / second')
plt.plot(results['time'], results['rotor_speed3'], label='Rotor 3 revolutions / second')
plt.plot(results['time'], results['rotor_speed4'], label='Rotor 4 revolutions / second')
plt.legend()
_ = plt.ylim()
When specifying a task, you will derive the environment state from the simulator. Run the code cell below to print the values of the following variables at the end of the simulation:
task.sim.pose (the position of the quadcopter in ($x,y,z$) dimensions and the Euler angles),task.sim.v (the velocity of the quadcopter in ($x,y,z$) dimensions), andtask.sim.angular_v (radians/second for each of the three Euler angles).# the pose, velocity, and angular velocity of the quadcopter at the end of the episode
print(task.sim.pose)
print(task.sim.v)
print(task.sim.angular_v)
In the sample task in task.py, we use the 6-dimensional pose of the quadcopter to construct the state of the environment at each timestep. However, when amending the task for your purposes, you are welcome to expand the size of the state vector by including the velocity information. You can use any combination of the pose, velocity, and angular velocity - feel free to tinker here, and construct the state to suit your task.
A sample task has been provided for you in task.py. Open this file in a new window now.
The __init__() method is used to initialize several variables that are needed to specify the task.
PhysicsSim class (from physics_sim.py). action_repeats timesteps. If you are not familiar with action repeats, please read the Results section in the DDPG paper.state_size), we must take action repeats into account. action_size=4). You can set the minimum (action_low) and maximum (action_high) values of each entry here.The reset() method resets the simulator. The agent should call this method every time the episode ends. You can see an example of this in the code cell below.
The step() method is perhaps the most important. It accepts the agent's choice of action rotor_speeds, which is used to prepare the next state to pass on to the agent. Then, the reward is computed from get_reward(). The episode is considered done if the time limit has been exceeded, or the quadcopter has travelled outside of the bounds of the simulation.
In the next section, you will learn how to test the performance of an agent on this task.
The sample agent given in agents/policy_search.py uses a very simplistic linear policy to directly compute the action vector as a dot product of the state vector and a matrix of weights. Then, it randomly perturbs the parameters by adding some Gaussian noise, to produce a different policy. Based on the average reward obtained in each episode (score), it keeps track of the best set of parameters found so far, how the score is changing, and accordingly tweaks a scaling factor to widen or tighten the noise.
Run the code cell below to see how the agent performs on the sample task.
import sys
import pandas as pd
from agents.policy_search import PolicySearch_Agent
from task import Task
num_episodes = 1000
target_pos = np.array([0., 0., 10.])
task = Task(target_pos=target_pos)
agent = PolicySearch_Agent(task)
for i_episode in range(1, num_episodes+1):
state = agent.reset_episode() # start a new episode
while True:
action = agent.act(state)
next_state, reward, done = task.step(action)
agent.step(reward, done)
state = next_state
if done:
print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f}), noise_scale = {}".format(
i_episode, agent.score, agent.best_score, agent.noise_scale), end="") # [debug]
break
sys.stdout.flush()
This agent should perform very poorly on this task. And that's where you come in!
Amend task.py to specify a task of your choosing. If you're unsure what kind of task to specify, you may like to teach your quadcopter to takeoff, hover in place, land softly, or reach a target pose.
After specifying your task, use the sample agent in agents/policy_search.py as a template to define your own agent in agents/agent.py. You can borrow whatever you need from the sample agent, including ideas on how you might modularize your code (using helper methods like act(), learn(), reset_episode(), etc.).
Note that it is highly unlikely that the first agent and task that you specify will learn well. You will likely have to tweak various hyperparameters and the reward function for your task until you arrive at reasonably good behavior.
As you develop your agent, it's important to keep an eye on how it's performing. Use the code above as inspiration to build in a mechanism to log/save the total rewards obtained in each episode to file. If the episode rewards are gradually increasing, this is an indication that your agent is learning.
This will check to make sure you have the correct version of TensorFlow and access to a GPU
"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
from distutils.version import LooseVersion
import warnings
import tensorflow as tf
# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer. You are using {}'.format(tf.__version__)
print('TensorFlow Version: {}'.format(tf.__version__))
# Check for a GPU
if not tf.test.gpu_device_name():
warnings.warn('No GPU found. Please use a GPU to train your neural network.')
else:
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
import matplotlib.pyplot as plt
# generate plot function
def plt_dynamic(fig, sub1, sub2, x, y1, y2, color_y1='g', color_y2='b'):
sub1.plot(x, y1, color_y1)
sub2.plot(x, y2, color_y2)
fig.canvas.draw()
# initialize plot
def plt_init(time_limit, y1_lower, y1_upper, y2_lower, y2_upper):
# create plots
fig, sub1= plt.subplots(1,1)
sub2 = sub1.twinx()
# set plot boundaries
sub1.set_xlim(0, time_limit) # this is typically time
sub1.set_ylim(y1_lower, y1_upper) # limits to your y1
sub2.set_xlim(0, time_limit) # time, again
sub2.set_ylim(y2_lower, y2_upper) # limits to your y2
# set labels and colors for the axes
sub1.set_xlabel('time (s)', color='k')
sub1.set_ylabel('y1-axis label', color='g')
sub1.tick_params(axis='x', colors='k')
sub1.tick_params(axis='y', colors="g")
sub2.set_ylabel('y2-axis label', color='b')
sub2.tick_params(axis='y', colors='b')
return fig, sub1, sub2
import matplotlib.pyplot as plt
# you must include '%matplotlib notebook' for this to work
##%matplotlib notebook
# generate plot function
def plt_dynamic(fig, sub1, sub2, x, y1, y2, color_y1='g', color_y2='b'):
sub1.plot(x, y1, color_y1)
sub2.plot(x, y2, color_y2)
fig.canvas.draw()
def plt_init():
# create plots
fig, sub1= plt.subplots(1,1)
sub2 = sub1.twinx()
# set plot boundaries
sub1.set_xlim(0, time_limit) # this is typically time
sub1.set_ylim(y1_lower, y1_upper) # limits to your y1
sub2.set_xlim(0, time_limit) # time, again
sub2.set_ylim(y2_lower, y2_upper) # limits to your y2
# set labels and colors for the axes
sub1.set_xlabel('time (s)', color='k')
sub1.set_ylabel('y1-axis label', color='g')
sub1.tick_params(axis='x', colors='k')
sub1.tick_params(axis='y', colors="g")
sub2.set_ylabel('y2-axis label', color='b')
sub2.tick_params(axis='y', colors='b')
return fig, sub1, sub2
Use Pendulum task from OpenAI gym.
# you must include '%matplotlib notebook' for this to work
%matplotlib notebook
import gym
import sys
from agents.agent import DDPG
from pendulum_task import PendulumTask
# pendulum plot values
time_limit = 100
y1_lower = -50
y1_upper = 0
y2_lower = -1
y2_upper = 1
num_episodes = 1000
task = PendulumTask()
agent = DDPG(task)
pendulum_rewards = []
display_freq = 30
display_step_freq = 10
for i_episode in range(1, num_episodes+1):
state = agent.reset_episode() # start a new episode
display_graph = i_episode % display_freq == 0
if display_graph:
# prior to the start of each episode, clear the datapoints
x, y1, y2 = [], [], []
fig, sub1, sub2 = plt_init(time_limit, y1_lower, y1_upper, y2_lower, y2_upper)
step = 0
total_reward = 0
while True:
step += 1
action = agent.act(state)
next_state, reward, done = task.step(action)
agent.step(action, reward, next_state, done)
state = next_state
total_reward += reward
if display_graph:
x.append(step) # time
y1.append(reward) # y-axis 1 values
y2.append(next_state[0]) # y-axis 2 values
if step % display_step_freq == 0:
plt_dynamic(fig, sub1, sub2, x, y1, y2)
if done:
print("\rEpisode = {:4d}, total reward = {:7.3f}".format(
i_episode, total_reward)) # [debug]
pendulum_rewards.append(total_reward)
if display_graph:
plt_dynamic(fig, sub1, sub2, x, y1, y2)
break
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(range(len(pendulum_rewards)), pendulum_rewards, label='total_rewards')
plt.legend()
_ = plt.ylim()
# you must include '%matplotlib notebook' for this to work
%matplotlib notebook
import sys
import numpy as np
import pandas as pd
from agents.agent import DDPG
from task import Task
# quadcopter values
time_limit = 6
y1_lower = -200
y1_upper = 0
y2_lower = 0
y2_upper = 50
num_episodes = 300
init_pose = [0., 0., 10., 0., 0., 0.]
init_velocities = [0., 0., 0.]
init_angle_velocities = [0., 0., 0.]
target_pos = np.array([0., 0., 50.])
task = Task(init_pose=init_pose, init_velocities=init_velocities,
init_angle_velocities=init_angle_velocities,target_pos=target_pos)
agent = DDPG(task)
total_rewards = []
display_freq = 20
display_step_freq = 10
for i_episode in range(1, num_episodes+1):
state = agent.reset_episode() # start a new episode
display_graph = i_episode % display_freq == 0
if display_graph:
# prior to the start of each episode, clear the datapoints
x, y1, y2 = [], [], []
fig, sub1, sub2 = plt_init()
total_reward = 0
step = 0
while True:
action = agent.act(state)
next_state, reward, done = task.step(action)
agent.step(action, reward, next_state, done)
state = next_state
total_reward += reward
step += 1
# within the episode loop
if display_graph:
x.append(task.sim.time) # time
y1.append(reward) # y-axis 1 values
y2.append(task.sim.pose[2]) # y-axis 2 values
#if step % display_step_freq == 0:
plt_dynamic(fig, sub1, sub2, x, y1, y2)
if done:
print("\rEpisode = {:4d}, total reward = {:7.3f}".format(
i_episode, total_reward)) # [debug]
total_rewards.append(total_reward)
break
Once you are satisfied with your performance, plot the episode rewards, either from a single run, or averaged over multiple runs.
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(range(len(total_rewards)), total_rewards, label='total_rewards')
plt.legend()
_ = plt.ylim()
Question 1: Describe the task that you specified in task.py. How did you design the reward function?
Answer: There were two tasks that I trained the agent on. One task was the pendulum task from the OpenAI gym. It was a good choice to train since it had a continuous reward function which made it easier to train than the mountain car. I left the reward function as the default and created a PendulumTask class as an interface to the gym simulator. I used an action repeat of 3 and had the task run for 300 time steps per episode.
After the agent made significant progress on the pendulum task I switched over to the quadcopter task. There were a number of modifications that I made to the original task based on information in Slack and the forum. One change was to constrain all rotor speeds to the same speed. The task I was working on was take off and the simulator created a lot of instability when the rotor speeds were even slightly different. This seemed like a lot for an agent to learn in a small number of episodes. I also constrained the rotor speeds to 375 to 450 since these were the speeds where the copter would go up or down at a reasonable rate. This task also had an action repeat of 3.
For the takeoff task I set the original position to [0, 0, 10] so that the simulator wouldn't think that the copter had crashed already. I set the target position to [0, 0, 100] so that the copter wouldn't run out of room to rise before the end of the episode. The reward function was set to -abs(target_z - actual_z). This rewarded the copter for getting as close as it could to the target position vertically.
Question 2: Discuss your agent briefly, using the following questions as a guide:
Answer: The only algorithm that I tried was DDPG since it seemed well suited to the problem. The DDPG paper explained how the algorithm could be used on a variety of control tasks in a continuous space. The paper also suggested network architectures and hyper-parameters to try.
The DDPG algorithm needed some tuning before it was able to work well with the pendulum and copter tasks. One problem that the agent had was not exploring enough in the early stages of learning. In order to increase the amount of exploration I doubled the noise process parameters suggested in the paper to theta=0.3 and sigma=0.4. This helped the agent explore more initially. Another parameter that was increased from the paper was tau. The suggested value was 0.001, however this level of adaptation from the off-policy network would mean very long training sessions and slow learning. I increased tau to 0.01 so that the agent could adapt faster based on what it learned from the environment. The learning rate that I ended up with on both the Actor and Critic networks was 0.02. This is higher that I would normally use, but the networks seemed to learn fine even with a higher learning rate.
The neural network that I built was based on the architecture specified in the DDPG paper. They used two dense layers with 400 and 300 nodes respectively. This turned out to train extremely slowly so I reduced the number of nodes to 200 and 150 and then added another dense layer of 100. The paper also specified that batch normalization and drop out was used so I added those features to the layers. The drop out rate was set at 0.3. The batch normalization helped the agent learn since the network values were scaled to be proportional across all variables. It did add considerable training time to each episode, though, so overall training happened much slower. All of the layers used a ReLU activation function since those worked well in other projects.
To decrease the training time I got the agent training on the AWS GPU EC2 instance that I have been using for class. I was hoping that this would dramatically speed up the training time, but the speed increase was negligible. I ended up doing all of the training on my laptop for convenience.
Question 3: Using the episode rewards plot, discuss how the agent learned over time.
Answer: Initally without any constraints on rotation speed or the relationship of rotor speeds the task was extremely hard to train. Even running for 1000 episodes it didn't seem to learn much. Once the task was constrained to a more reasonable action space then the agent was able to learn.
Most of the learning appeared to happen suddenly. The copter was crashing for the first 170 episodes of the training. Then suddenly it learned how to take off and it could take off consistently after that. The rewards in the first episodes were higher because the copter crashed ending the episode and all of the rewards for the episode didn't accumulate. Because the reward function is always negative the copter wasn't punished as much as it should have been after the crash. The rewards in the first episodes oscillated between -7600 and -7800. After the agent learned to take off the rewards were consistently at -8000.
Question 4: Briefly summarize your experience working on this project. You can use the following prompts for ideas.
Answer: There were many challenging aspects to this project. The most difficult part was the level of ambiguity in how to proceed with the project. Fortunately there were helpful classmates that gave suggestions on how to approach the project. Without that I would have been even more lost to begin with.
Another difficult part of the project was getting to the point that I could see what was going on while the agent trained. Again a classmate had posted some helpful code on how to create real-time plots to watch the agent train inside of an episode. Getting to the point where I could watch my agent miserably fail took a lot of effort, but once I could watch what was happening it was easier to iterate and make changes to improve it.
Overall this was by far the most difficult project of this course. It took a long time to get on the right track. Now that it is over, though, I feel like I have learned a lot about reinforcement learning because of having to try so many things and go over all of the details of the task and agent.